EDA

Some of the dataframes appear to have NaN values in them.

Based on the above counts we can see that, there are about 95,419 unique customers. And about 32,951 products and 3,095 sellers.

Most of the customers live in Sao Paulo. Followed by Rio de Janeiro after a considerable margin.

It is not surprising to see that most orders are from Sao Paulo.

Most of the orders placed are from bed_bath_table category followed by health_beauty by almost 2,000.

This comes as no surprise that most of the sellers are from Sao Paulo, since most of the orders are from Sao Paulo itself.

As per the above plot, most of the orders are placed stating from 8 AM till 11 PM. While 4 PM being the most active hour.

Data Preprocessing

K-Means Clustering

As per elbow method, the number of clusters to be set is 5.

Above is the scatter plot of the clusters plotted using TSNE. It is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.

It is highly recommended to use another dimensionality reduction method (e.g. PCA for dense data or TruncatedSVD for sparse data) to reduce the number of dimensions to a reasonable amount (e.g. 50) if the number of features is very high. This will suppress some noise and speed up the computation of pairwise distances between samples.

PCA

DBScan

Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.

Above is the scatter 3d plot constructed using Plotly which shows the 4 clusters and noise(purple markers) scattered across the axes.